Bag-of-Embeddings for Text Classification
نویسندگان
چکیده
Words are central to text classification. It has been shown that simple Naive Bayes models with word and bigram features can give highly competitive accuracies when compared to more sophisticated models with part-of-speech, syntax and semantic features. Embeddings offer distributional features about words. We study a conceptually simple classification model by exploiting multiprototype word embeddings based on text classes. The key assumption is that words exhibit different distributional characteristics under different text classes. Based on this assumption, we train multiprototype distributional word representations for different text classes. Given a new document, its text class is predicted by maximizing the probabilities of embedding vectors of its words under the class. In two standard classification benchmark datasets, one is balance and the other is imbalance, our model outperforms state-of-the-art systems, on both accuracy and macro-average F-1 score.
منابع مشابه
A New Document Embedding Method for News Classification
Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...
متن کاملFrom Image to Text Classification: A Novel Approach based on Clustering Word Embeddings
In this paper, we propose a novel approach for text classification based on clustering word embeddings, inspired by the bag of visual words model, which is widely used in computer vision. After each word in a collection of documents is represented as word vector using a pre-trained word embeddings model, a k-means algorithm is applied on the word vectors in order to obtain a fixed-size set of c...
متن کاملLarge Scale Multi-label Text Classification with Semantic Word Vectors
Multi-label text classification has been applied to a multitude of tasks, including document indexing, tag suggestion, and sentiment classification. However, many of these methods disregard word order, opting to use bag-of-words models or TFIDF weighting to create document vectors. With the advent of powerful semantic embeddings, such as word2vec and GloVe, we explore how word embeddings and wo...
متن کاملBag of Region Embeddings via Local Context Units for Text Classification
Contextual information and word orders are proved valuable for text classification task. To make use of local word order information, n-grams are commonly used features in several models, such as linear models. However, these models commonly suffer the data sparsity problem and are difficult to represent large size region. The discrete or distributed representations of n-grams can be regarded a...
متن کاملProximity-based Graph Embeddings for Multi-label Classification
In many real applications of text mining, information retrieval and natural language processing, large-scale features are frequently used, which often make the employed machine learning algorithms intractable, leading to the well-known problem “curse of dimensionality”. Aiming at not only removing the redundant information from the original features but also improving their discriminating abili...
متن کاملComparison of Short-Text Sentiment Analysis Methods for Croatian
We focus on the task of supervised sentiment classification of short and informal texts in Croatian, using two simple yet effective methods: word embeddings and string kernels. We investigate whether word embeddings offer any advantage over corpusand preprocessing-free string kernels, and how these compare to bag-ofwords baselines. We conduct a comparison on three different datasets, using diff...
متن کامل